32
Quantization of Neural Networks
decoder module to low bits, i.e., (1)+(2)+(3), brings the most significant accuracy drops
of accuracy among all parts of the DETR methods, up to 2.1% in the 3-bit DETR-R50.
At the same time, other parts of DETR show comparative robustness to the quantization
function. Consequently, the critical problem of improving the quantized DETR methods is
restoring the information in MHA modules after quantization. Other qualitative results in
Fig. 2.8 and Fig. 2.9 also indicate that the degraded information representation is the main
obstacle to a better quantized DETR.
2.4.3
Information Bottleneck of Q-DETR
To address the information distortion of the quantized DETR, we aim to improve the
representation capacity of the quantized networks in a knowledge distillation framework.
Generally, we utilize a real-valued DETR as a teacher and a quantized DETR as a student,
distinguished with superscripts T and S.
Our Q-DETR pursues the best tradeoffbetween performance and compression, which
is precisely the goal of the information bottleneck (IB) method through quantifying the
mutual information that the intermediate layer contains about the input (less is better)
and the desired output (more is better) [210, 223]. In our case, the intermediate layer comes
from the student, while the desired output includes the ground-truth labels as well as the
queries of the teacher for distillation. Thus, the objective target of our Q-DETR is:
min
θS I(X; ES) −βI(ES, qS; yGT ) −γI(qS; qT ),
(2.27)
where qT and qS represent the queries in the teacher and student DETR methods as
predefined in Eq. (2.26); β and γ are the Lagrange multipliers [210]; θS is the parame-
ters of the student; and I(·) returns the mutual information of two input variables. The
first item I(X; ES) minimizes information between input and visual features ES to extract
task-oriented hints [240]. The second item I(ES, qS; yGT ) maximizes information between
extracted visual features and ground-truth labels for better object detection. Common net-
work training and detection loss constraints can easily accomplish these two items, such as
proposal classification and coordinate regression.
The core issue of this paper is to solve the third item I(qS; qT ), which attempts to
address the information distortion in student query via introducing teacher query as a
priori knowledge. To accomplish our goal, we first expand the third item and reformulate
it as:
I(qS; qT ) = H(qS) −H(qS|qT ),
(2.28)
where H(qS) returns the self information entropy expected to be maximized while
H(qS|qT ) is the conditional entropy expected to be minimized. It is challenging to optimize
the above maximum and minimum items simultaneously. Instead, we make a compromise to
reformulate Eq. (2.28) as a bi-level issue [152, 46] that alternately optimizes the two items,
which is explicitly defined as:
min
θ
H(qS∗|qT ),
s. t.
qS∗= arg max
qS
H(qS).
(2.29)
Such an objective involves two sub-problems, including an inner-level optimization to
derive the current optimal query qS∗and an upper-level optimization to conduct knowledge
transfer from the teacher to the student. Below, we show that the two sub-problems can be
solved in the forward and backward network propagation.